63 research outputs found
ELF: An End-to-end Local and Global Multimodal Fusion Framework for Glaucoma Grading
Glaucoma is a chronic neurodegenerative condition that can lead to blindness.
Early detection and curing are very important in stopping the disease from
getting worse for glaucoma patients. The 2D fundus images and optical coherence
tomography(OCT) are useful for ophthalmologists in diagnosing glaucoma. There
are many methods based on the fundus images or 3D OCT volumes; however, the
mining for multi-modality, including both fundus images and data, is less
studied. In this work, we propose an end-to-end local and global multi-modal
fusion framework for glaucoma grading, named ELF for short. ELF can fully
utilize the complementary information between fundus and OCT. In addition,
unlike previous methods that concatenate the multi-modal features together,
which lack exploring the mutual information between different modalities, ELF
can take advantage of local-wise and global-wise mutual information. The
extensive experiment conducted on the multi-modal glaucoma grading GAMMA
dataset can prove the effiectness of ELF when compared with other
state-of-the-art methods
Depth Assisted Full Resolution Network for Single Image-based View Synthesis
Researches in novel viewpoint synthesis majorly focus on interpolation from
multi-view input images. In this paper, we focus on a more challenging and
ill-posed problem that is to synthesize novel viewpoints from one single input
image. To achieve this goal, we propose a novel deep learning-based technique.
We design a full resolution network that extracts local image features with the
same resolution of the input, which contributes to derive high resolution and
prevent blurry artifacts in the final synthesized images. We also involve a
pre-trained depth estimation network into our system, and thus 3D information
is able to be utilized to infer the flow field between the input and the target
image. Since the depth network is trained by depth order information between
arbitrary pairs of points in the scene, global image features are also involved
into our system. Finally, a synthesis layer is used to not only warp the
observed pixels to the desired positions but also hallucinate the missing
pixels with recorded pixels. Experiments show that our technique performs well
on images of various scenes, and outperforms the state-of-the-art techniques
Towards Ghost-free Shadow Removal via Dual Hierarchical Aggregation Network and Shadow Matting GAN
Shadow removal is an essential task for scene understanding. Many studies
consider only matching the image contents, which often causes two types of
ghosts: color in-consistencies in shadow regions or artifacts on shadow
boundaries. In this paper, we tackle these issues in two ways. First, to
carefully learn the border artifacts-free image, we propose a novel network
structure named the dual hierarchically aggregation network~(DHAN). It contains
a series of growth dilated convolutions as the backbone without any
down-samplings, and we hierarchically aggregate multi-context features for
attention and prediction, respectively. Second, we argue that training on a
limited dataset restricts the textural understanding of the network, which
leads to the shadow region color in-consistencies. Currently, the largest
dataset contains 2k+ shadow/shadow-free image pairs. However, it has only 0.1k+
unique scenes since many samples share exactly the same background with
different shadow positions. Thus, we design a shadow matting generative
adversarial network~(SMGAN) to synthesize realistic shadow mattings from a
given shadow mask and shadow-free image. With the help of novel masks or
scenes, we enhance the current datasets using synthesized shadow images.
Experiments show that our DHAN can erase the shadows and produce high-quality
ghost-free images. After training on the synthesized and real datasets, our
network outperforms other state-of-the-art methods by a large margin. The code
is available: http://github.com/vinthony/ghost-free-shadow-removal/Comment: Accepted by AAAI 202
Locality Preserving Multiview Graph Hashing for Large Scale Remote Sensing Image Search
Hashing is very popular for remote sensing image search. This article
proposes a multiview hashing with learnable parameters to retrieve the queried
images for a large-scale remote sensing dataset. Existing methods always
neglect that real-world remote sensing data lies on a low-dimensional manifold
embedded in high-dimensional ambient space. Unlike previous methods, this
article proposes to learn the consensus compact codes in a view-specific
low-dimensional subspace. Furthermore, we have added a hyperparameter learnable
module to avoid complex parameter tuning. In order to prove the effectiveness
of our method, we carried out experiments on three widely used remote sensing
data sets and compared them with seven state-of-the-art methods. Extensive
experiments show that the proposed method can achieve competitive results
compared to the other method.Comment: 5 pages,icassp accepte
High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset and A Frequency-Aware Shadow Erasing Net
Shadows often occur when we capture the documents with casual equipment,
which influences the visual quality and readability of the digital copies.
Different from the algorithms for natural shadow removal, the algorithms in
document shadow removal need to preserve the details of fonts and figures in
high-resolution input. Previous works ignore this problem and remove the
shadows via approximate attention and small datasets, which might not work in
real-world situations. We handle high-resolution document shadow removal
directly via a larger-scale real-world dataset and a carefully designed
frequency-aware network. As for the dataset, we acquire over 7k couples of
high-resolution (2462 x 3699) images of real-world document pairs with various
samples under different lighting circumstances, which is 10 times larger than
existing datasets. As for the design of the network, we decouple the
high-resolution images in the frequency domain, where the low-frequency details
and high-frequency boundaries can be effectively learned via the carefully
designed network structure. Powered by our network and dataset, the proposed
method clearly shows a better performance than previous methods in terms of
visual quality and numerical results. The code, models, and dataset are
available at: https://github.com/CXH-Research/DocShadow-SD7KComment: Accepted by International Conference on Computer Vision 2023 (ICCV
2023
A Large-scale Film Style Dataset for Learning Multi-frequency Driven Film Enhancement
Film, a classic image style, is culturally significant to the whole
photographic industry since it marks the birth of photography. However, film
photography is time-consuming and expensive, necessitating a more efficient
method for collecting film-style photographs. Numerous datasets that have
emerged in the field of image enhancement so far are not film-specific. In
order to facilitate film-based image stylization research, we construct
FilmSet, a large-scale and high-quality film style dataset. Our dataset
includes three different film types and more than 5000 in-the-wild high
resolution images. Inspired by the features of FilmSet images, we propose a
novel framework called FilmNet based on Laplacian Pyramid for stylizing images
across frequency bands and achieving film style outcomes. Experiments reveal
that the performance of our model is superior than state-of-the-art techniques.
Our dataset and code will be made publicly available
Explicit Visual Prompting for Universal Foreground Segmentations
Foreground segmentation is a fundamental problem in computer vision, which
includes salient object detection, forgery detection, defocus blur detection,
shadow detection, and camouflage object detection. Previous works have
typically relied on domain-specific solutions to address accuracy and
robustness issues in those applications. In this paper, we present a unified
framework for a number of foreground segmentation tasks without any
task-specific designs. We take inspiration from the widely-used pre-training
and then prompt tuning protocols in NLP and propose a new visual prompting
model, named Explicit Visual Prompting (EVP). Different from the previous
visual prompting which is typically a dataset-level implicit embedding, our key
insight is to enforce the tunable parameters focusing on the explicit visual
content from each individual image, i.e., the features from frozen patch
embeddings and high-frequency components. Our method freezes a pre-trained
model and then learns task-specific knowledge using a few extra parameters.
Despite introducing only a small number of tunable parameters, EVP achieves
superior performance than full fine-tuning and other parameter-efficient
fine-tuning methods. Experiments in fourteen datasets across five tasks show
the proposed method outperforms other task-specific methods while being
considerably simple. The proposed method demonstrates the scalability in
different architectures, pre-trained weights, and tasks. The code is available
at: https://github.com/NiFangBaAGe/Explicit-Visual-Prompt.Comment: arXiv admin note: substantial text overlap with arXiv:2303.1088
- …